Pruning False Unknown Words to Improve Chinese Word Segmentation
نویسندگان
چکیده
During the process of unknown word detection in Chinese word segmentation, many detected word candidates are invalid. These false unknown word candidates deteriorate the overall segmentation accuracy, as it will affect the segmentation accuracy of known words. Therefore, we propose to eliminate as many invalid word candidates as possible by a pruning process. Our experiments show that by cutting down the invalid unknown word candidates, we improve the segmentation accuracy of known words and hence that of the overall segmentation accuracy.
منابع مشابه
Chinese and Japanese Word Segmentation Using Word-Level and Character-Level Information
In this paper, we present a hybrid method for Chinese and Japanese word segmentation. Word-level information is useful for analysis of known words, while character-level information is useful for analysis of unknown words, and the method utilizes both these two types of information in order to effectively handle known and unknown words. Experimental results show that this method achieves high o...
متن کاملA Hybrid Model for Chinese Word Segmentation
This paper describes a hybrid model that combines machine learning with linguistic and statistical heuristics for integrating unknown word identification with Chinese word segmentation. The model consists of two major components: a tagging component that annotates each character in a Chinese sentence with a position-of-character (POC) tag that indicates its position in a word, and a merging com...
متن کاملSegmenting Chinese Unknown Words by Heuristic Method
Chinese text segmentation is important in Chinese text indexing. Due to the lack of word delimiters in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, the segmentation ambiguities and the occurrences of out-of-vocabulary words (i.e. unknown words) are the major challenges in Chinese segmentation. Many research works dealing with the problem of ...
متن کاملCombining Machine Learning with Linguistic Heuristics for Chinese Word Segmentation
This paper describes a hybrid model that combines machine learning with linguistic heuristics for integrating unknown word identification with Chinese word segmentation. The model consists of two components: a position-of-character (POC) tagging component that annotates each character in a sentence with a POC tag that indicates its position in a word, and a merging component that transforms a P...
متن کاملHybrid Models for Chinese Unknown Word Resolution Dissertation
Word segmentation, part-of-speech (POS) tagging, and sense tagging are important steps in various Chinese natural language processing (CNLP) systems. Unknown words, i.e., words that are not in the dictionary or training data used in a CNLP system, constitute a major challenge for each of these steps. This dissertation is concerned with developing hybrid models that effectively combine statistic...
متن کامل